In this lab, we will continue the steps of a geostatistical analysis, using the set of precipitation data for Switzerland, and the data set of soil samples from a floodplain near the Meuse river in the Netherlands.
Before starting the lab, you will need to set up a new folder for your working directory. Go to your geog6000 folder now and create a new folder for today’s class called lab11. The following files will be used in this lab, all available on Canvas:
You should have these files from last week’s lab, but if not you will need to download these files from Canvas, and move them from your Downloads folder to the datafiles folder that you made previously. Make sure to unzip the zip files so that R can access the content. Note that on Windows, you will need to right-click on the file and select ‘Extract files’.
Now start RStudio and change the working directory to lab11. As a reminder, you can do this by going to the [Session] menu in RStudio, then [Change working directory]. This will open a file browser that you can use to browse through your computer and find the folder.
You will also need several packages to carry out all the steps listed here: sf, stars, RColorBrewer and gstat. Again, you should have these from last weeks lab, but if not make sure these are installed these before starting.
pkgs <- c("gstat",
"stars",
"RColorBrewer",
"sf")
install.packages(pkgs)
library(ggplot2)
library(gstat)
library(RColorBrewer)
library(sf)
library(stars)
library(viridis)
With all the examples given, it is important to not just type in the code, but to try changing the parameters and re-running the functions multiple times to get an idea of their effect. Help on the parameters can be obtained by typing help(functionname) or ?functionname.
By default gstat assumes that your data are in a Cartesian projection, unless the Spatial* object containing the data has projection metadata specifying that it is on a spherical (lat/lon) coordinate system. Without this, longitude and latitude coordinates will be treated as Cartesian, and distances will be incorrectly calculated for variogram analysis and kriging. We can illustrate this with the Oregon dataset. Load this and check the associated projection (this should read ‘NA’):
oregon = st_read("../datafiles/oregon/oregontann.shp", quiet = TRUE)
st_crs(oregon)
## Coordinate Reference System: NA
Let’s make a quick plot of the temperature data:
# names(oregon)
plot(oregon["tann"], pch = 16)
This shows a fairly clear pattern with warmer temperatures at lower elevations and closer to the coast. We’ll now make a variogram for to explore this pattern (first load the gstat package):
or_vgm <- variogram(tann ~ 1, oregon)
plot(or_vgm)
The distance lags are calculated in degrees. To correct this, we can either project the data or simply specify that these data are spherical coordinates. In general, the second of these is a better approach, unless you are working in a very small area, as all projections will distort distances and/or directions over large areas. Here, we set the st_crs to EPSG 4326, which is the WGS84 standard:
st_crs(oregon) <- 4326
Now if we remake the variogram, gstat recognizes this as spherical coordinates, and calculates distances as great circle distances in km, rather than Euclidean degrees, reducing the bias in distances.
or_vgm <- variogram(tann ~ 1, oregon)
plot(or_vgm)
This code is taken from last week’s lab and will load and convert all the necessary files for the Swiss precipitation data:
## Precipitation data
swiss <- read.csv("../datafiles/swiss_ppt/swiss_ppt.csv")
swiss.sf <- st_as_sf(swiss,
coords = c("x", "y"),
crs = 2056)
## Elevation grid
swiss.dem <- read_stars("../datafiles/swiss_ppt//swiss_dem.asc")
st_crs(swiss.dem) <- 2056
## Swiss border
countries <- st_read("../datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp", quiet = TRUE)
swiss.bord <- subset(countries, NAME == "Switzerland")
swiss.bord <- st_transform(swiss.bord, 2056)
Make a simple plot of precipitation amounts:
plot(swiss.sf["ppt"], reset = FALSE, pch = 16)
plot(st_geometry(swiss.bord), add = TRUE)
Alternatively, we can use the ggplot2 library:
ggplot() +
geom_sf(data = swiss.bord) +
geom_sf(data = swiss.sf, aes(col = ppt), size = 2.5) +
scale_color_viridis_c(direction = -1) +
theme_bw()
Now load the Meuse datasets:
meuse <- st_read("../datafiles/meuse/meuse.shp", quiet = TRUE)
st_crs(meuse) <- 28992
meuse.riv <- st_read("../datafiles/meuse/meuseriv.shp", quiet = TRUE)
st_crs(meuse.riv) <- 28992
meuse.grid <- st_read("../datafiles/meuse/meusegrid.shp", quiet = TRUE)
st_crs(meuse.grid) <- 28992
meuse.grid <- st_rasterize(meuse.grid["dist"], dx = 40, dy = 40)
A histogram of the Swiss precipitation data shows that it is right-skewed. Before carrying out any analysis, we’ll log transform it to normalize the distribution. Note that as there are several sites that have zero precipitation we also add a small positive value to these sites to make the log transformation possible.
hist(swiss.sf$ppt)
swiss.sf$lppt <- log(swiss.sf$ppt + 1e-1)
Kriging can be extended to include covariates to improve prediction. One of the most flexible methods is kriging with an external drift, where the drift refers to any covariate that has been recorded both at the sampling locations and at the prediction locations. A scatterplot of the precipitation against the elevation of each site shows a weak negative relationship, with generally higher values at lower elevations:
plot(lppt ~ elev, swiss.sf)
We will use this model together with elevation data from the DEM to perform kriging with an external drift on the precipitation data. As we are incorporating a covariate in our model, we first need to remake the precipitation variogram, so that the variogram takes this into account:
ppt.var <- variogram(lppt ~ elev, swiss.sf)
plot(ppt.var, plot.numbers = TRUE)
Now let’s fit a model as we did before:
modNugget <- 0.05
modRange <- 100000
modSill <- 0.75
ppt.vgm1 <- vgm(psill = modSill,
"Sph",
range = modRange,
nugget = modNugget)
ppt.vgm2 <- fit.variogram(ppt.var, ppt.vgm1)
plot(ppt.var, ppt.vgm2, main = "Swiss precip. variogram")
We will now use this variogram to interpolate the precipitation data. We use the same function as in the previous lab (krige()), but now specify ‘elev’ as an independent variable in the model formula. This requires that both the spatial points (swiss.sf) and the new locations (swiss.dem) have a variable called `elev’, so let’s check this first:
## Check to see that both the data and grid have elev variables
names(swiss.sf)
## [1] "id" "ppt" "elev" "geometry" "lppt"
names(swiss.dem)
## [1] "swiss_dem.asc"
This shows that the swiss.dem variable was named using the original file name. let’s rename this
names(swiss.dem) <- "elev"
Now we can go ahead and run the model:
## kriging with external drift
ppt.pred.ked <- krige(lppt ~ elev,
swiss.sf,
swiss.dem,
ppt.vgm2)
## [using universal kriging]
Plot the new predictions:
# names(ppt.pred.ked)
my.pal = brewer.pal(9, "Blues")
plot(ppt.pred.ked["var1.pred"],
col = my.pal,
main = "Swiss log precipitation (KED)")
Note that we can also back-transform to precipitation in mm by using exp():
ppt.pred.ked$ppt <- exp(ppt.pred.ked$var1.pred)
plot(ppt.pred.ked["ppt"],
col = my.pal,
main = "Swiss precipitation (KED)")
As before, we can estimate the prediction skill of the model using cross-validation:
ppt.cv.ked <- krige.cv(lppt ~ elev,
swiss.sf,
ppt.vgm2,
nmax = 40,
nfold = 5)
## RMSE
sqrt(mean(ppt.cv.ked$residual^2))
## [1] 0.4114167
## R2
cor(ppt.cv.ked$observed, ppt.cv.ked$var1.pred)^2
## [1] 0.7960134
Regression kriging provides a more flexible approach to including covariates than universal kriging or external drift kriging, but requires a little more work. The idea is that rather than trying to incorporate a potentially complicated relationship in the model, this is modeled separately, then the residuals are interpolated using simple kriging. A final estimate at each new location can then be made by adding the predicted value from the original model to the interpolated residual. This opens up the possibility of using any regression technique with the covariate(s) to model the larger, structural trends, and then using simple kriging to model the deviations from this trend.
As a simple example, we will build a linear model of precipitation using elevation as the covariate, and use the residuals from this as the basis for kriging. Note that as we have specified the covariate in the regression model, we no longer need to include it in the variogram or the krige() function.
fit1 <- lm(lppt ~ elev, swiss.sf)
swiss.sf$resid <- residuals(fit1)
resid.var <- variogram(resid ~ 1, swiss.sf)
plot(resid.var)
Now fit a variogram model to this:
resid.vgm <- vgm(0.75, "Cir", 100000, 0.05)
plot(resid.var, resid.vgm)
We now interpolate the residuals (I haven’t fit the variogram model to the sample data, but feel free to do so). Note that as these are residuals, the mean is assumed known (\(=0\)), so we use simple kriging for the interpolation. The parameter beta is used to set the value of the mean:
resid.sk <- krige(resid ~ 1,
swiss.sf,
swiss.dem,
resid.vgm,
nmax = 40,
beta = 0)
## [using simple kriging]
my.pal <- brewer.pal(9, "PRGn")
plot(resid.sk["var1.pred"],
col = my.pal,
main = "Swiss precipitation residuals (RK)")
To make the final estimates, we first predict the precipitation using the simple linear model and the elevation values on the DEM, and store this in the stars DEM object:
swiss.dem$ppt.lm <- predict(fit1, swiss.dem)
Now we add the interpolated residuals to these values:
swiss.dem$ppt.rk <- swiss.dem$ppt.lm + resid.sk$var1.pred
And finally visualize the predictions:
my.pal <- brewer.pal(9, "Blues")
plot(swiss.dem["ppt.rk"],
col = my.pal,
main = "Swiss precipitation (Regression Kriging)")
We can, of course, back-transform the data to the non-log scale using exp(). While the results here are not that much different to the kriging with an external drift listed above, the method is much more flexible. We could replace the linear model with generalized linear models, additive models, mixed-effects models and even machine learning methods.
Indicator kriging is used to interpolate binary variables as probabilities. It can be used to estimate whether the variable of interest will be over or below a given threshold at a new location, or the probability that a new location will have a binary or categorical variable. In either case, the method consists quite simply of interpolating binary values (0/1) using ordinary kriging. As this uses the same function as before, you can include a trend or external drift if necessary.
We’ll first use this method to find all locations in Switzerland with over 40mm of rainfall. Here, we create a new variable in the swiss.sf data frame, which is whether or not the station had \(>\) 40mm rainfall, and use spplot() to make a quick figure.
swiss.sf$ppt40 <- swiss.sf$ppt > 40
plot(swiss.sf["ppt40"],
pch = 16,
main = "PPT > 40mm")
Now we proceed as in the previous lab:
ppt40.var <- variogram(ppt40 ~ 1, swiss.sf)
plot(ppt40.var)
ppt40.vgm <- vgm(0.035, "Sph", 40000, 0.01)
plot(ppt40.var, ppt40.vgm)
ppt40.vgm2 <- fit.variogram(ppt40.var, ppt40.vgm)
plot(ppt40.var, ppt40.vgm2)
ppt40.ik <- krige(ppt40 ~ 1,
swiss.sf,
swiss.dem,
ppt40.vgm2,
nmax = 40)
## [using ordinary kriging]
plot(ppt40.ik["var1.pred"])
Note that these are not true probabilities (some values \(<0\) are obtained). For the purposes of geostatistical interpolation, however, these are considered as close to being probabilities and are often corrected to between 0 and 1, which we’ll do next. Indicator simulation (see below) offers a method to interpolate true probabilities.
First, we use the which() function to find all pixels with a value below zero and reset this to zero. We then plot using one of the viridis color palettes (you will need to make sure this is installed):
ppt40.ik$var1.pred[which(ppt40.ik$var1.pred < 0)] <- 0
my.pal <- rev(viridis::magma(10))
plot(ppt40.ik["var1.pred"],
col = my.pal,
breaks = seq(0, 1, length.out = 11),
main = "P(ppt > 40mm)")
The Meuse data set contains soil type for each of the sampling sites, split into three classes. Use the spplot() function to look at their spatial distribution:
plot(meuse["soil"], pch = 16, reset = FALSE)
plot(meuse.riv, add = TRUE, col = NA)
The individual categories can be interpolated to new locations using indicator kriging. As before we start by making and modeling the variogram, then interpolate onto a grid.
Star by making and modeling the sample variogram. Note that rather than creating a new variable, we use the I() function, which tells R to create a new variable internally in a model, here a binary value where soil class 1 equals 1, and other classes equal zero:
s1.var <- variogram(I(soil == 1) ~ 1, meuse, cutoff = 2000)
s1.vgm <- vgm(psill = 0.25, model = "Sph", range = 900, nugget = 0.1)
s1.vgm <- fit.variogram(s1.var, s1.vgm)
plot(s1.var, s1.vgm, main = "Soil class 1")
We now interpolate using the krige() function:
s1.ik <- krige(I(soil == 1) ~ 1, meuse, meuse.grid, s1.vgm)
## [using ordinary kriging]
my.pal <- brewer.pal(9, "Greens")
plot(s1.ik["var1.pred"],
col = my.pal,
main = "P(Soil == 1)",
reset = FALSE)
plot(meuse.riv, col = NA, add = TRUE)
Now do the same for the other two soil classes, to get interpolated probabilities for class 2 (s2.ik) and 3 (s3.ik).
s2.var <- variogram(I(soil == 2) ~ 1, meuse, cutoff = 2000)
vgm_model <- vgm(psill = 0.25, model = "Sph", range = 900, nugget = 0.1)
s2.vgm <- fit.variogram(s2.var, model = vgm_model)
plot(s2.var, s2.vgm, main = "Soil class 2")
s2.ik <- krige(I(soil == 2) ~ 1, meuse, meuse.grid, s2.vgm)
## [using ordinary kriging]
s3.var <- variogram(I(soil == 3) ~ 1,
meuse,
cutoff = 2000)
s3.vgm <- fit.variogram(s3.var, model = vgm_model)
plot(s3.var, s3.vgm, main = "Soil class 3")
s3.ik <- krige(I(soil == 3) ~ 1, meuse, meuse.grid, s3.vgm)
## [using ordinary kriging]
Once you have the probabilities for all three classes, we can use these to estimate the most probable class at each new location. We do this in three steps: first, combine the individual probability interpolations into a single matrix; second, find for each row, the column with the highest probability using max.col() and assign the output as a new variable in meuse.grid; third, plot the results.
soil.prob <- cbind(c(s1.ik$var1.pred),
c(s2.ik$var1.pred),
c(s3.ik$var1.pred))
meuse.grid$soil.pred <- max.col(soil.prob)
my.pal <- brewer.pal(3, "Set2")
plot(meuse.grid["soil.pred"], col = my.pal)
This can be extended to larger numbers of classes, as long as there is sufficient data to produce a variogram for each one, and that you have the patience to model them all.
All geostatistical simulation methods are designed to produce random spatial fields, where the value at each location is produced by random draws from a probability distribution function defined by the observations. In contrast to straightforward generation of random values, spatially random fields produce random values at each location, but while preserving spatial structure. Individual simulations are much less smooth than kriging interpolation, as the values at any two neighboring locations are randomly chosen, but are spatially correlated as described by a variogram.
Geostatistical simulations come in two forms: constrained and unconstrained. In the unconstrained type, the random field is based on a specified mean and variance, and the variogram for spatial structure. The random fields produced have the same statistical and spatial characteristics, but the minima and maxima may occur anywhere in the study area. Constrained simulations also include the location and value of the observed points. This ensures that minima and maxima occur where they are defined by the original points, and the resulting fields have the same pattern as the original data. We will concentrate here on the constrained type of simulation.
Like ordinary kriging, Gaussian simulation can be performed with continuous variables, and uses a similar setup to the kriging carried out above and in previous labs. We’ll use this with the Swiss precipitation data, so first make a variogram and fit a variogram model.
modNugget <- 0.05
modRange <- 100000
modSill <- 0.75
ppt.var <- variogram(lppt ~ 1, swiss.sf)
ppt.vgm1 <- vgm(psill = modSill,
"Sph",
range = modRange,
nugget = modNugget)
ppt.vgm2 <- fit.variogram(ppt.var, ppt.vgm1)
plot(ppt.var, ppt.vgm2, main = "Swiss precip. variogram")
To carry out 6 random simulations of the Swiss precipitation data, we use the krige() function again, with the spatial data, output grid, variogram, etc. The new parameter used here is nsim which controls the number of output simulations:
ppt.pred.sgs <- krige(lppt ~ 1,
swiss.sf,
swiss.dem,
ppt.vgm2,
nmax = 40,
nsim = 6)
## drawing 6 GLS realisations of beta...
## [using conditional Gaussian simulation]
The spplot() function is very useful for visualizing the output as it plots a grid of all required simulations:
my.pal <- brewer.pal(9, "Blues")
plot(ppt.pred.sgs,
col = my.pal,
main = "Swiss ppt (SGS)")
## downsample set to c(1,1,1)
In general, single simulations are only of interest to examine the degree of variation between observations (as opposed to kriging which provides smoothed interpolations). The power of the simulation approach is in producing a large number of possible realizations of a spatial field. This allows a better assessment of uncertainty at any location, as we can obtain not just the mean estimated value and a confidence interval, but the full probability distribution of interpolated values.
In the next bit of code, we will produce one hundred simulations of precipitation, then extract the predicted values for a single point and make this into a histogram. So first, re-run the krige function with a higher number of simulations:
ppt.pred.sgs <- krige(lppt ~ 1,
swiss.sf,
swiss.dem,
ppt.vgm2,
nmax = 40,
nsim = 100)
## drawing 100 GLS realisations of beta...
## [using conditional Gaussian simulation]
Next, we create a new point location to extract the simulated values.
newloc <- st_geometry(st_point(c(2650000, 1200000)))
st_crs(newloc) <- st_crs(swiss.dem)
plot(swiss.dem, reset = FALSE, axes = TRUE)
plot(newloc, pch = "x", cex = 3, col = 2, add = TRUE)
We use the function st_extract() for extracting points, and then plot the resulting values as a histogram:
newloc.ppt <- st_extract(ppt.pred.sgs, newloc)
newloc.ppt$ppt <- exp(newloc.ppt$var1)
hist(newloc.ppt$ppt,
breaks = 20,
col = "darkorange",
main = "Precipitation at (2650000, 1200000)",
xlab = "mm")
Note that as we are using the same formula notation as kriging, we could easily extend the basic simulation approach to include covariates to represent external drifts or trends.
The simulation approach can also be used for indicator interpolation. As in the previous example, we simply reuse the krige() function. In addition to the nsim parameter, we include a parameter indicators=TRUE to perform indicator kriging. In this case, rather than each simulation estimating a probability at each new location, the method estimates a binary value (presence or absence) by drawing from a binomial distribution. To simulate 6 realizations of the distribution of soil class 1:
s1.sis <- krige(I(soil == 1) ~ 1,
meuse,
meuse.grid,
s1.vgm,
nsim = 6,
indicators = TRUE,
nmax = 40)
## drawing 6 GLS realisations of beta...
## [using conditional indicator simulation]
plot(s1.sis)
## downsample set to c(0,0,1)
If we now run multiple simulations, we can assess uncertainty. In this case, we get 1000 simulations of the presence of soil type 1 at each location. To estimate probability, we then take the sum of all presences (1) and divide by the number of simulations (1000). This is stored in the output, and then can be plotted using spplot().
s1.sim <- krige(I(soil == 1) ~ 1,
meuse,
meuse.grid,
s1.vgm,
nsim = 1000,
indicators = TRUE,
nmax = 40)
## drawing 1000 GLS realisations of beta...
## [using conditional indicator simulation]
s1.prob <- st_apply(s1.sim, c(1,2), sum) / 1000
plot(s1.prob, main = "P(Soil == 1)")
In contrast to the indicator kriging, the output from this function provides true probabilities of presence, constrained to the range [0,1].
It may help to add other shapefiles to your maps for this exercise. The following code assumes that you have read in the ozone shapefile to a sf object called ozone.sf, and adds state outlines, lakes and cities using the Natural Earth datasets. The example here plots out the ozone concentrations, but you can adapt this pretty easily for the predicted values you obtain. Note that to run this code, you will need to make sure the tmap package is installed and download and unzip the following files from Canvas:
To overlay multiple shapefiles, we can make use of the reset and add arguments when plotting sf objects. Start by reading in the three extra layers:
ozone.sf <- st_read("../datafiles/mwozone/ozone.shp")
st_crs(ozone.sf) <- 4326
## Read data
states <- st_read("../datafiles/ne_50m_admin_1_states_provinces/ne_50m_admin_1_states_provinces_shp.shp")
lakes <- st_read("../datafiles/ne_50m_lakes/ne_50m_lakes.shp")
places <- st_read("../datafiles/ne_50m_populated_places/ne_50m_populated_places.shp")
Now we plot the layers. Each time we add the argument reset = TRUE to tell R to keep the same basic plot outline. For the second (and subsequent plots) we also set add = TRUE to prevent R from resetting the plot.
plot(st_geometry(ozone.sf), reset = FALSE)
plot(st_geometry(lakes), reset = FALSE, add = TRUE, col = "lightblue")
plot(st_geometry(states), reset = FALSE, add = TRUE)
plot(ozone.sf["ozone"], add = TRUE, pch = 16)
Better plots with multiple layer can be made by using the geom_sf function from ggplot2:
places <- cbind(places, st_coordinates(st_centroid(places)))
ggplot() +
geom_sf(data = lakes, fill = "lightblue") +
geom_sf(data = states, fill = NA) +
geom_sf(data = ozone.sf, aes(col = ozone), size = 2) +
scale_color_viridis_c() +
geom_label(data = places, aes(X, Y, label = NAME), size = 2.5) +
coord_sf(xlim = c(-94, -82), ylim = c(36, 45), expand = FALSE) +
theme_bw()
Alternatively you can use the tmap package to make a better plot:
library(tmap)
mybbox <- st_bbox(ozone.sf)
tm_shape(lakes, bbox = mybbox) +
tm_fill("lightblue") +
tm_shape(states) +
tm_borders() +
tm_shape(ozone.sf) +
tm_symbols(col = "ozone", size = 0.5, palette = "viridis") +
tm_shape(places) +
tm_text("NAME", size = 0.75)
pred.grid <- st_as_stars(mybbox,
xlim = c(-94, -82),
ylim = c(36, 45),
dx = 0.1,
dy = 0.1)
st_crs(pred.grid) <- 4326